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Abstract — In this paper, we present an information theoretic 
analysis of the blind signal classification algorithm. We show 
that the algorithm is equivalent to a Maximum A Posteriori 
(MAP) estimator based on estimated parametric probability 
models. We prove a lower bound on the error exponents of 
the parametric model estimation. It is shown that the estimated 
model parameters converge in probability to the true model 
parameters except some small bias terms. 

I. Introduction 

In this paper, we consider the blind signal classification 
problems. In the considered scenarios, a sequence of random 
signal samples Xi, X2, . . . , is observed, where each signal 
sample is a real number or a vector in a finite dimensional 
space. It is assumed that the signal samples are generated 
by J information sources with different statistical properties. 
However, it is unknown from which information source each 
signal sample Xn is emitted. The blind signal classification 
problems denote the problems of estimating the membership 
of each signal sample to the information sources. The signal 
classification problems find applications in many areas of 
image processing, computer vision and machine learning, 
for example, in image segmentation, and cluster analysis. 
For background information in these applications, we refer 
interested readers to |1| |2l and references therein. 

In 131, a novel algorithm for the signal classification prob- 
lems is proposed based on data compression. The algorithm 
is based on the intuitive idea that optimal classification in- 
duces optimal adaptive data compression. Therefore, the signal 
classification problems can be formulated as optimization 
problems. An analysis in an algorithmic viewpoint was also 
presented. It was shown in the paper |3| that a soft membership 
relaxation can be used to reduce the computational complexity 
with asymptotic vanishing optimality loss. Simulation results 
show that the algorithm has nice performance. 

It is well known that there exist close connections between 
information theory and statistical inference. Especially, source 
coding and data compression have been used in statistical 
inference problems, such as prediction, estimation and mod- 
eling, see for instance |4|, (5], (E), fT\. However, there exists 
no discussion on using data compression for classification 
and clustering until very recent. In lH, O, a "clustering by 
compression" algorithm has been proposed. The approach in 
ID, is is different from the approach in [3] in terms of 
their ways of using data compression. In |8l, ||9l, the data 
compression methods are used to compute distances between 



data items. The clustering results are then obtained by using 
conventional methods based on the computed distances. 

In this paper, we present an information theoretical analysis 
to justify the intuitive idea of the blind signal classification 
algorithm in |3|. It is shown that the blind signal classification 
algorithm is equivalent to a Maximum A Posteriori (MAP) 
estimator based on estimated parametric probability models. 
We also discuss the error exponents of the model parameter 
estimation. It is shown that the estimated model parameters 
converge to the true model parameters in probability. These 
theoretical discussions suggest that the blind signal classifica- 
tion algorithm has nice performance. 

The discussions in this paper focus on the cases that the 
information sources are independent and identically distributed 
(i.i.d.) Gaussian, and there are two information sources. Even 
though, more sophisticated cases are not covered in this paper, 
the discussions presented here can provide useful insights 
into these more general cases. The discussions can be easily 
generalized to the cases of multiple information sources. 
With additional works, the results can also be generalized 
to the cases of non-Gaussian, Markov, stationary or ergodic 
information sources. 

Notation: we use [-J to denote the floor function, that is, 
[xj is the largest integer smaller than x. We use In(-) to 
denote the logarithmic function with base e. We use H{P) 
and D{P\\Q) to denote the entropy function and information 
divergence respectively. If P and Q are discrete probability 
mass functions, then 



i7(P)- ^-P(:r)ln(P(x)), 

xex 



xex 



Q{x) 



(1) 



where X is the discrete alphabet. If P and Q are probability 
density functions, then 



H{P)^ I -P{x)hi{P{x))dx 

DiPWQ) 



Q{x) 



(2) 



If f{N) and g{N) are two functions of the number N, we 
use f{N) ^ g{N) to denote that 



N^oo N \g{N) J 



(3) 



For a sequence xiX2 ... a; at over a discrete alphabet X, the 
type of the sequence is defined as the corresponding empirical 
distribution P over X, that is, P{a) is equal to the fraction 
of Xi taking value a. For a type P, the type class T{P, N) 
is the set of sequences with length and type P. We use 
F{T{P,N)) to denote the probability that the type of the 
random sequence xiX2 ■ ■ ■ xn is P- 

The rest of this paper is organized as follows. We review 
the data compression based signal classification algorithm in 
Section We discuss the necessary conditions for the optimal 
solutions of the blind signal classification algorithm in Section 
Hin We discuss the error exponents of the parameter estimation 
in Section |IV] The concluding remarks are presented in 
Section |V] 

II. Blind Signal Classification Algorithm 

In this paper, we consider the scenario, where a sequence 
of random real-valued signal samples Xi, X2, ■ ■ ■ , Xjy is 
observed. Each signal sample X„ is independently drawn from 
one of the several i.i.d. Gaussian information sources. The 
probability density function is 



2al 



(4) 



where J is the total number of information sources, is the 
probability that X„ is drawn from the i-th information source, 
and ^i, (Ti are the Gaussian distribution parameters of the i-th 
information source. The goal is to estimate the membership 
of each signal sample Xn to the J information sources. 

The blind signal classification algorithm in f3] is based on a 
data compression argument that accurate signal classification 
results in accurate signal modeling and efficient data compres- 
sion. Therefore, the signal samples should be classified, so 
that the coding efficiency is maximized. Let m„i denote the 
membership variable for the n-th signal sample with respect 
to the i-th class. 



nir, 



1, if a;„ is classified into the i-th class 
0, otherwise 



(5) 



The algorithm searches for the optimal values m„i such that 
the following objective function is minimized, 



(6) 



where Ui is the fraction of signal samples that are classified 
into the i-th class, af is the variance of signal samples in the 
i-th class, H{ai, . . . , aj) is the entropy function in nats. 



N 



(7) 



The objective function G relates to the so called classification 
gain and adaptive coding efficiency |3|. 

In 13], the hard membership variables m„i are relaxed 
into soft membership variables. That is, m„i can take real 



values, such that J2i = 1, < m„i < 1. It is proven 
in [3| that the optimality loss due to the relaxation vanishes 
asymptotically. 

III. Necessary Condition for Optimization 
Solutions 

In this section, we show a necessary condition for the 
solution in the above blind signal classification method with 
soft membership variables. It turns out that useful insights can 
be gained from the necessary condition. 

Let us assume that the probability density function of X„ 
is P*{X„) = alf*{x) + a*J*{x), where 



1 



■ cxp 



2« 



(8) 



Let irini denote one global minimizer of the optimization 
programming in the blind signal classification method. Let 
Si, 0.2, /ii, /i2, <^2 denote the distribution parameters 
corresponding to the optimization solution m„i. 

Theorem 3.1: The optimal solution m„i satisfies the fol- 
lowing condition. 



rrini 



1, if ai/i(a;„) > a2/2(a:„) 
0, if ai/i(x„) < a2/2(a;„) 



(9) 



where /i and /2 are the Gaussian probabihty density functions 
corresponding to the parameters /ii, /Z2, S^i, a2, 

\2^ 



1 



27rtT, 



■ exp 



-{x- 



2d 



(10) 



Proof: (sketch) Consider G as a function solely deter- 
mined by nini- Taking a derivative, we have 

^2- 



(9(ln(G')) _l_ 



In (dl) -21n(Si) + 



N' 



1 

N 
:ln 



hUal) -21n(S2) 



2na2f2{Xn) 



N 



27rai/i(a;„) 
(11) 



Therefore, 



dlnO 

druni 
d\nG 



< 0, if ai fi{x„) > a2f2{xn) (12) 
> 0, if ai/i(x„) < a2f2{xn) (13) 



The theorem then follows from the KKT condition fTU\. ■ 
Corollary 3.2: Define three subsets A,B,C of real num- 
bers as follows, A = l^x\aifi{x) > a2f2ix)^ ,B = 

[x\aifi{x) = a272(a;)| ,C = |x|ai/i(a;) < a2/2(a;)|. We 
write n £ A (n e B, n £ C), if Xn & A (Xn & B, Xn & C). 
Then, the following statements hold. 
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a2 



N 



(14) 
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(15) 



(16) 



(17) 
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Remark 1: Theorem 13.11 shows that the data compression 
based signal classification method is equivalent to the MAP 
estimation based on the estimated parametric probability mod- 
els. Even though the probability model estimation is just a by- 
product of the classification algorithm, the accuracy of model 
estimation is critical to the performance of the algorithm. 

IV. Error Exponent of Parameter Estimation 

In this section, we investigate the accuracy of the parametric 
probability model estimation in the proposed signal classifica- 
tion method by using the method of types |ir|. We need to 
introduce several auxiliary discrete probability distributions. 
Let Mn ,Wn be numbers, which only depend on the 
number of signal samples N, 



Mn = c{N) 



L]y = 2 



\osN 



+ 1, Wn = 



2M 



N 



Ln 



(19) 



where c,(,il are positive constants, and C + ^7 < 1/2. Let 
afc denote the number of signal samples, which fall in the 
interval [{k - l/2)WN,{k + 1/2)Wn]- Let On denote the 
random event that |a;„| > Mn for some n, 1 < n < iV. 
If On does not occur, then {. . . , ak/N, . . .} is a well defined 
empirical probability mass function, where 



-L 



N 



1 



We 

[{k- 
if [{k 
[(fc- 
Cfc as, 

Cfc = 



(20) 



write k e ^, if the interval 

l/2)VKAr,(fc + l/2)W^jv] C A. We write k e C, 
-l/2)W^A,,(fc + l/2)VKjv] C C. We write k € B, if 
1/2)Wn, {k + l/2)W^Ar] n S 7^ 0. If fc e we define 

_ En=i rnml {xn e [{k - 1/2)Wn, {k + 1/2)Wn]) 



ELiHxne [{k~l/2)WN,ik + l/2)WN]) 



(21) 



where /(•) is the indicator function. 

Let 9 denote the 6-tuple {ai,a2, iii, fJ,2,o'ii'^2}- We use 
P(x; 8) to denote the mixture Gaussian distribution. 



We use PN{k; O) to denote the following discrete probability 
distribution over the same alphabet of (. . . , Ofe/iV, . . .), 



PN{k;e) = 

/•{k+l/2)WN / 2 
J{k-1/2)WN Vj=l 



"'^ exp f H rf-, 



27rcr,- 



2af 



(23) 



where cp is a normalization constant, cp —> 1, as iV —> 00. 
Lemma 4.1: 

2 



/ 

^exp 

.=1 y 



c(7V)(HC)+^^ 

(c(iV)(5+C) _ 
2RF 



In AT 



IniV 



(24) 



Proof: 



(a) 



N 



n=l 

N 2 

= ^^P(|X„| > MAr|a;„is of class i)P(x„ is of class 



n=l 2—1 
N 2 



< ^ ^P(|A:„| > MAr|a;„is of class i) 

(MN + tJ. 



2 
i=l 

mJ^ I [c{N) 

< _^cxp I - 

1=1 



nY^q 



^exp 



2(<F" 
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-InA^ 
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(25) 



where, Q{-) denotes the well known Gaussian tail function, (a) 
follows from the union bound, and (b) follows from the well 
known Chernoff bound Q{x) < exp{—x^/2) (see for example 
[12: Section 2-1-5]). ■ 
Lemma 4.2: Let Q = {Si, S2, /^i, M2; 0^1, 0^2} be the es- 
timated model parameters. Assume that the random event 
On does not occur. Then, the following bound holds, which 
relates the type {ak/N} to the estimated probability model 
parameters. 



Proof: (sketch) The bound is proved in Eq. |22] where (a) 
follows from the fact that ln( ) is an increasing function, (b) 
follows from the mean-value theorem, and (c) follows from 
Eqs. [n] [IS 

■ 

Theorem 4.3: Let V denote a set of mixture Gaussian 
distributions with parameters {ai, a2, /^i, /^2, ci, 0-2}, where 
af are lower bounded by a positive constant Ba- As- 
sume that the true model distribution 6* ^ V, Q* ~ 
{al,a2, nl, ^ll,al,a2}■ Define the error exponent Er['D) = 
limAT^oo ^ InP {P eV^. Let T{T>) denote the set of prob- 
ability distribution with well-defined probability density func- 
tion P{x), such that, there exists a P{x] 9), G 2?, = 
{3i, 52,^^1,^*2,51,^2} and 



D{P{x)\\P{x-e))+H{P{x)) 



< 



ai 



ln(27reCTi) 



0:2 



2 ' 2 
Then Er{V) > E^, where 



EdV)= min D{P{x)\\P{x]Q*)) 



(28) 



(29) 



Proof: (sketch) According to Lemma HTTI the exponent of 
the random event P(C'Ar) is infinity. Therefore, 

P(e e 2?) = P(e e P, O^) + P(e e V\ON)f{ON) 

~P(ee2?,0$v)- (30) 

Let FCD, e) denote the set of probability distribution with 
probability density function P{x), such that, there exists 
P[x] 9), 6 e 2?, 6 = a2, Ml, /^2, CTi, (T2}, and 

D{P{x)\\P(x-Q))+H{P{x)) 



< 



ai 



ln(27reCTi) 



a2 



ln(27recr|) + i/(ai,a2) + e. (31) 



Let 0(1), N,e) denote the set of type P of sequences 
with length N, such that, there exists P{x; Q), Q ^ V, 
6 = {ai,a2,Aii,/l2,(?i,CT2}, and 

D{P(x)\\PN{k-Q)) + H{P{x)) 
2 ~ 

- H Y ln(27re52) + H{ai,a2) - I^Wn + e. (32) 

1=1 

According to Lemma l4!2l if G 2?, and O m does not occur, 
then the type {ak/N} e g{V, N, e). Therefore, 



P(ee2?,05v) < 



E 

Pee(r>,7v,£) 



P(r(P,iV)) 



(a) 



(fc) 



max P(r(P,iV)) 

Peg(v,N,t) 

expf- min D(P| |P7v(fc; e*))iV 

V Pee(D,Af,e) 

(33) 

where, (a) follows from the fact that the number of type class 
is upper bounded by 

(A^ + l)^"~l, (34) 



and (b) follows from first principles in the method of types 
[111. 

Let {hk/N} denote the above type, which minimizes 
D{P\\Pis!{k; 9*)). We can construct a probability distribution 
with probability density function Q{x\ N, bk/N) as follows. 

Qix;N,b,/N)^{ W^' ^^\j\<M.,\x-kWM<^ 
I 0, otherwise 

(35) 

It can be checked that Q{x; N, bk/N) e ei), and 

D{bu/N\\PN{k- e*)) > D{Q{x;N, bk/N)\\P{x- 9*)) - 62, 

(36) 

where 61,62 are some small positive numbers, 61,62 — > 0, as 
iV -> cx). 

As a consequence, 

min D(P\\PN(k;e*)) 
Peg{v,N,c) 

> min i:)(P(.T)||F(x;9*)) -62. (37) 

Finally, the theorem follows from the fact that all information 
divergence and entropy functions are continuous. ■ 
Theorem 4.4: For sufficiently large N, with probability 
close to one. 



D{P{x; 9*)| |P(a;; 9)) < , aj) + e, 



(38) 



where e is a small positive number, 6 — > 0, as — > 00. 

Proof: (sketch) We define ak/N and Q{x;N,ak/N) 
similarly as in the above. With probabihty close to one. On 
does not occur, and 



D{P{x; 9*)||P(x; 9)) - DiQix; N,ak/N)\\Pix; 9)) 



Note that 



< £1. 
(39) 



HiP{x; 9*)) > ^ In (2^e(ar)2) + ^ In (2^e(a*)2) . 

(40) 

By Lemma ITTI and l4~2l we have with probability close to one 

2?(P(x;9*)||P(a;;9)) 



< ^ In ((;?i)2) + ^ In ((^2)2) + i7(Si, S2) 
ln(Kf)-f ln((a2*f)+62. 



Q!2 



(41) 



The theorem then follows from the fact that 



^ln((?if) + ^ln((;?2f)+i/(ai,S2)=min ^ , 

(42) 

with probability close to one 

^ In ((ai)2) + ^ In ((^2)') + H{a,,a2) 
< ^ In {(al)') + ^ In ((a,*)^) + iJ(at, a*) + 63. (43) 



a2 



ln(G) ^ 



(afe/^^ll-Pjv(fc; e)) + (afc/7V) = ^ ^ In (FAr(fc; §) 



In 



ai 



E 

fcGB 



^(fc+l/2)H'N 

(fe-l/2)Wjv V27rCTi 
(fc+l/2)Wjv 



exp 



2^2 



a2 



27ra. 



■ exp 



2dl 



dx 



- Hep) 



(fc-i/2)Wjv v27r(Ti 



"1 exp f 1 



2^2 



-(1 - Ck)ak 



E 

fcec 



-afc 
N 



N 



In 



In 



•(fc+l/2)Wjv 



a2 



(fc-l/2)Wjv v27r'?2 



exp 



2al 



012 



= y ^In 



E 



(fe+l/2)Wjv 



(fc-i/2)TyN v27rCT2 

(fc+l/2)14'jv 



exp 



2ai 



exp I — ^ 0-^2^^^ ^ 

(fc-l/2)14'jv V 2o-i 



-(1 - Cfc)afe 



N 



■In 

Q^l 1 , "2 



r(fe+l/2)VKjv 



(fe-l/2)Wjv 



exp 



-(^ - M2)^ 

2^1 



dx 



E' 



E 

feee 



da; 



In(cp) 



In 



(fc+i/2)iyjv 



(fe-i/2)Wjv v27r(T 



■ exp 



2^2 



TV 



In 



exp I —2 I ax 

(fc-l/2)Wjv \ 2(7^ 



E 

fcec 



~Qfc 
TV 



In 



/.(fe+l/2)Wjv 



/(fc-l/2)14'jv 



exp 



2ai 



dx 



^ ln(27ra2) + ^ ln(27rai) + S^) - In(cp) 



2CT2iV 



2ct27V 



^„gc(2x„ - 2/22 + WnWn + E„gB m„2(2a;„ - 2/22 + W^Af)W^jv 

2ct|TV 

+ ln(27ra2) + ^ ln(2w2) + S2) - In(T^Ar) - In(cp) 



(c) "1 , ,^ ^2n , "2 

= Yln(27recri) + — 



ln(2^ea2) + a2) + ( ^ + ^ ) - ln(M/Ar) - In(cp) 



da; 



(27) 



In the above, £1,62, £3 are all small positive real numbers, 
eij £2, £3 0, as iV — > 00. ■ 
Remark 2: Theorem 14.31 and 14.41 show that the estimated 
model parameters converge to the true model parameters in 
probability except some bias terms. The bound in Theorem 
14.31 can be further improved. However, because the discussion 
is much involved, we leave it for future research. 

V. Conclusion 

In this paper, we present an information theoretic perfor- 
mance analysis of the blind signal classification algorithm 
proposed in |3|. We show that the obtained classification 
results in the algorithm is equivalent to a MAP estimator using 
the estimated parametric probability models. We further show 
that the by-product model parameter estimation is accurate. 
These theoretical analysis suggests that the algorithm has nice 
performance. 
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